Predicting U.S. Movie Revenue: An In-Depth ML Analysis of IMDb Data

By: Johann Antisseril, Connor O'Meara, & Jasmine Parekh

Part 1

1.1 Introduction

IMDb is a popular online platform that is designed to help people review, research, and explore the world of movies. The movie industry is a large part of most people’s lives, who regularly consume content featured on IMDb and other similar websites. The film and movie industry is a multi-billion dollar industry that employs thousands of people who are all interested in seeing their projects succeed financially. But a lot of the time, movies simply don’t perform well enough past their budget, or even meet it! A team of hundreds could work tirelessly on a movie and pour money into it for years only for it to flop, while another film might make millions of dollars past the budget in the first showing after it’s released in theaters or on streaming platforms. Therefore, the film industry is heavily invested in determining the qualities of a movie that correlate positively with strong revenue generation. Determining these money-making qualities is therefore not just a million dollar question, but potentially a multi-billion dollar question. This data exploration and analysis will work through the IMDb data, as well as make use of separate data sets that rank the popularity of the most popular directors, actors, and actresses, to determine what exactly it is that successful movies have in common and if it is possible to predict the revenue range that a movie will generate based off of historic data.

1.2 The Dataset

The Main IMDb Dataset - IMDB data from 2006 to 2016:

Found here: https://www.kaggle.com/PromptCloudHQ/imdb-data/code

The IMDb data from 2006 to 2016 we are taking into consideration has been divided into several columns:

  • Rank: Rank of movie (overall popularity)
  • Title: Title of the movie
  • Genre: Category or type of movie (Action, Drama, etc.)
  • Description: A short written segment of the movie plot or concept
  • Director: Director of the movie
  • Actors: Cast list
  • Year: Year released
  • Runtime (Minutes): Runtime of movie in minutes
  • Rating: Net rating given to the movie by users
  • Votes: Number of votes by users
  • Revenue (Millions): Revenue generated by movie in millions of dollars
  • Metascore: Weighted score assigned based off of critic reviews, each critic rating weighted by their fame

1.3 Further Resources

We sourced from these two main websites to create new attribute columns:

As part of our ML analysis, we introduce new attribute columns: popularity of directors, actors, actresses, and genre as a metric in order to measure the contribution of a cast, directors, and genres in determining the revenue generated by a given movie. We did this because the original columns are qualitative and cannot be applied in our ML algorithims. Therefore we sourced from different datasets to assign rank/points to each directors/actor in order to average them and create a quantative values for further ML accuracy.

Below are the keys created from two main sources - IMDB lists and YouGovAmerica Polls that rank directors and actors. Lastly, we have the genre key that was created by summing all the movies based on genre to rank top grossing genres in order to create its own key:

  • IMDb Cast Key:
    • Name: Name of actor/actress
    • Points: Essentially a marker of popularity and industry presence
  • IMDb Director Key:
    • Name: Name of director
    • Points: Essentially a marker of popularity and industry presence
  • YGA Cast Key:
    • Name: Name of cast member
    • Rank: Rank 1-1460
    • Fame(%): Metric of fame determined by YGA
    • Popularity(%): Metric of fame determined by YGA
  • YGA Director Key:
    • Name: Name of director
    • Rank: Rank 1-140
    • Fame(%): Metric of fame determined by YGA
    • Popularity(%): Metric of fame determined by YGA

And these are the new attribute columns that are created using the keys. We would loop through each set of actors to grab their associated ranks/points and then average them by group size. For the directors we just grab their associated rank/points. Thereby creating these columns:

  • Revenue Per IMDb Average Cast Points

  • Revenue Per YGA Average Cast Ranks

  • Revenue Per IMDb Average Director Points

  • Revenue Per YGA Director Rank

  • Revenue Per Average Genre Rank

Part 2

2.1 Getting the Data

The data is read in from the CSV file to a pandas DataFrame. All rows with missing values for revenue are taken out as one of the biggest measures of success that is being explored is revenue. It does not make sense for to keep rows with missing values and is easier to work with the remaining values. Each type is modified to be the type they should be. For example,'Rank', 'Year', 'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', and 'Metascore' should all be numeric types and 'Genre' and 'Actors' should be a list of strings.

In [1]:
#!pip install pdfplumber
# !pip install plotly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import re
import pdfplumber
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

#Reads in the IMDB Movie Data CSV file and removes all the empty values for revenue and resets the index
movie_data = pd.read_csv("IMDB-Movie-Data.csv")
movie_data = movie_data[pd.notnull(movie_data['Revenue (Millions)'])].reset_index()
movie_data.drop(columns=['index'], inplace = True)

#Changes the appropriate columns to the correct type and creates a list of genres and actors
change_types = ['Rank', 'Year', 'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
for i in change_types:
    movie_data[[i]] = movie_data[[i]].apply(pd.to_numeric)
movie_data['Genre'] = movie_data['Genre'].apply(lambda x: x.split(','))
movie_data['Actors'] = movie_data['Actors'].apply(lambda x: re.split(',\s?', x))
movie_data
Out[1]:
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy [Action, Adventure, Sci-Fi] A group of intergalactic criminals are forced ... James Gunn [Chris Pratt, Vin Diesel, Bradley Cooper, Zoe ... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus [Adventure, Mystery, Sci-Fi] Following clues to the origin of mankind, a te... Ridley Scott [Noomi Rapace, Logan Marshall-Green, Michael F... 2012 124 7.0 485820 126.46 65.0
2 3 Split [Horror, Thriller] Three girls are kidnapped by a man with a diag... M. Night Shyamalan [James McAvoy, Anya Taylor-Joy, Haley Lu Richa... 2016 117 7.3 157606 138.12 62.0
3 4 Sing [Animation, Comedy, Family] In a city of humanoid animals, a hustling thea... Christophe Lourdelet [Matthew McConaughey, Reese Witherspoon, Seth ... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad [Action, Adventure, Fantasy] A secret government agency recruits some of th... David Ayer [Will Smith, Jared Leto, Margot Robbie, Viola ... 2016 123 6.2 393727 325.02 40.0
5 6 The Great Wall [Action, Adventure, Fantasy] European mercenaries searching for black powde... Yimou Zhang [Matt Damon, Tian Jing, Willem Dafoe, Andy Lau] 2016 103 6.1 56036 45.13 42.0
6 7 La La Land [Comedy, Drama, Music] A jazz pianist falls for an aspiring actress i... Damien Chazelle [Ryan Gosling, Emma Stone, Rosemarie DeWitt, J... 2016 128 8.3 258682 151.06 93.0
7 9 The Lost City of Z [Action, Adventure, Biography] A true-life drama, centering on British explor... James Gray [Charlie Hunnam, Robert Pattinson, Sienna Mill... 2016 141 7.1 7188 8.01 78.0
8 10 Passengers [Adventure, Drama, Romance] A spacecraft traveling to a distant colony pla... Morten Tyldum [Jennifer Lawrence, Chris Pratt, Michael Sheen... 2016 116 7.0 192177 100.01 41.0
9 11 Fantastic Beasts and Where to Find Them [Adventure, Family, Fantasy] The adventures of writer Newt Scamander in New... David Yates [Eddie Redmayne, Katherine Waterston, Alison S... 2016 133 7.5 232072 234.02 66.0
10 12 Hidden Figures [Biography, Drama, History] The story of a team of female African-American... Theodore Melfi [Taraji P. Henson, Octavia Spencer, Janelle Mo... 2016 127 7.8 93103 169.27 74.0
11 13 Rogue One [Action, Adventure, Sci-Fi] The Rebel Alliance makes a risky move to steal... Gareth Edwards [Felicity Jones, Diego Luna, Alan Tudyk, Donni... 2016 133 7.9 323118 532.17 65.0
12 14 Moana [Animation, Adventure, Comedy] In Ancient Polynesia, when a terrible curse in... Ron Clements [Auli'i Cravalho, Dwayne Johnson, Rachel House... 2016 107 7.7 118151 248.75 81.0
13 15 Colossal [Action, Comedy, Drama] Gloria is an out-of-work party girl forced to ... Nacho Vigalondo [Anne Hathaway, Jason Sudeikis, Austin Stowell... 2016 109 6.4 8612 2.87 70.0
14 16 The Secret Life of Pets [Animation, Adventure, Comedy] The quiet life of a terrier named Max is upend... Chris Renaud [Louis C.K., Eric Stonestreet, Kevin Hart, Lak... 2016 87 6.6 120259 368.31 61.0
15 17 Hacksaw Ridge [Biography, Drama, History] WWII American Army Medic Desmond T. Doss, who ... Mel Gibson [Andrew Garfield, Sam Worthington, Luke Bracey... 2016 139 8.2 211760 67.12 71.0
16 18 Jason Bourne [Action, Thriller] The CIA's most dangerous former operative is d... Paul Greengrass [Matt Damon, Tommy Lee Jones, Alicia Vikander,... 2016 123 6.7 150823 162.16 58.0
17 19 Lion [Biography, Drama] A five-year-old Indian boy gets lost on the st... Garth Davis [Dev Patel, Nicole Kidman, Rooney Mara, Sunny ... 2016 118 8.1 102061 51.69 69.0
18 20 Arrival [Drama, Mystery, Sci-Fi] When twelve mysterious spacecraft appear aroun... Denis Villeneuve [Amy Adams, Jeremy Renner, Forest Whitaker, Mi... 2016 116 8.0 340798 100.50 81.0
19 21 Gold [Adventure, Drama, Thriller] Kenny Wells, a prospector desperate for a luck... Stephen Gaghan [Matthew McConaughey, Edgar Ramírez, Bryce Dal... 2016 120 6.7 19053 7.22 49.0
20 22 Manchester by the Sea [Drama] A depressed uncle is asked to take care of his... Kenneth Lonergan [Casey Affleck, Michelle Williams, Kyle Chandl... 2016 137 7.9 134213 47.70 96.0
21 24 Trolls [Animation, Adventure, Comedy] After the Bergens invade Troll Village, Poppy,... Walt Dohrn [Anna Kendrick, Justin Timberlake, Zooey Desch... 2016 92 6.5 38552 153.69 56.0
22 25 Independence Day: Resurgence [Action, Adventure, Sci-Fi] Two decades after the first Independence Day i... Roland Emmerich [Liam Hemsworth, Jeff Goldblum, Bill Pullman, ... 2016 120 5.3 127553 103.14 32.0
23 27 Bahubali: The Beginning [Action, Adventure, Drama] In ancient India, an adventurous and daring ma... S.S. Rajamouli [Prabhas, Rana Daggubati, Anushka Shetty, Tama... 2015 159 8.3 76193 6.50 NaN
24 28 Dead Awake [Horror, Thriller] A young woman must save herself and her friend... Phillip Guzman [Jocelin Donahue, Jesse Bradford, Jesse Borreg... 2016 99 4.7 523 0.01 NaN
25 29 Bad Moms [Comedy] When three overworked and under-appreciated mo... Jon Lucas [Mila Kunis, Kathryn Hahn, Kristen Bell, Chris... 2016 100 6.2 66540 113.08 60.0
26 30 Assassin's Creed [Action, Adventure, Drama] When Callum Lynch explores the memories of his... Justin Kurzel [Michael Fassbender, Marion Cotillard, Jeremy ... 2016 115 5.9 112813 54.65 36.0
27 31 Why Him? [Comedy] A holiday gathering threatens to go off the ra... John Hamburg [Zoey Deutch, James Franco, Tangie Ambrose, Ce... 2016 111 6.3 48123 60.31 39.0
28 32 Nocturnal Animals [Drama, Thriller] A wealthy art gallery owner is haunted by her ... Tom Ford [Amy Adams, Jake Gyllenhaal, Michael Shannon, ... 2016 116 7.5 126030 10.64 67.0
29 33 X-Men: Apocalypse [Action, Adventure, Sci-Fi] After the re-emergence of the world's first mu... Bryan Singer [James McAvoy, Michael Fassbender, Jennifer La... 2016 144 7.1 275510 155.33 52.0
... ... ... ... ... ... ... ... ... ... ... ... ...
842 961 Trance [Crime, Drama, Mystery] An art auctioneer who has become mixed up with... Danny Boyle [James McAvoy, Rosario Dawson, Vincent Cassel,... 2013 101 7.0 97141 2.32 61.0
843 962 Into the Forest [Drama, Sci-Fi, Thriller] After a massive power outage, two sisters lear... Patricia Rozema [Ellen Page, Evan Rachel Wood, Max Minghella, ... 2015 101 5.9 10220 0.01 59.0
844 963 The Other Boleyn Girl [Biography, Drama, History] Two sisters contend for the affection of King ... Justin Chadwick [Natalie Portman, Scarlett Johansson, Eric Ban... 2008 115 6.7 88260 26.81 50.0
845 964 I Spit on Your Grave [Crime, Horror, Thriller] A writer who is brutalized during her cabin re... Steven R. Monroe [Sarah Butler, Jeff Branson, Andrew Howard, Da... 2010 108 6.3 60133 0.09 27.0
846 968 The Walk [Adventure, Biography, Crime] In 1974, high-wire artist Philippe Petit recru... Robert Zemeckis [Joseph Gordon-Levitt, Charlotte Le Bon, Guill... 2015 123 7.3 92378 10.14 NaN
847 970 The Lone Ranger [Action, Adventure, Western] Native American warrior Tonto recounts the unt... Gore Verbinski [Johnny Depp, Armie Hammer, William Fichtner, ... 2013 150 6.5 190855 89.29 NaN
848 971 Texas Chainsaw 3D [Horror, Thriller] A young woman travels to Texas to collect an i... John Luessenhop [Alexandra Daddario, Tania Raymonde, Scott Eas... 2013 92 4.8 37060 34.33 62.0
849 972 Disturbia [Drama, Mystery, Thriller] A teen living under house arrest becomes convi... D.J. Caruso [Shia LaBeouf, David Morse, Carrie-Anne Moss, ... 2007 105 6.9 193491 80.05 NaN
850 973 Rock of Ages [Comedy, Drama, Musical] A small town girl and a city boy meet on the S... Adam Shankman [Julianne Hough, Diego Boneta, Tom Cruise, Ale... 2012 123 5.9 64513 38.51 47.0
851 974 Scream 4 [Horror, Mystery] Ten years have passed, and Sidney Prescott, wh... Wes Craven [Neve Campbell, Courteney Cox, David Arquette,... 2011 111 6.2 108544 38.18 52.0
852 975 Queen of Katwe [Biography, Drama, Sport] A Ugandan girl sees her world rapidly change a... Mira Nair [Madina Nalwanga, David Oyelowo, Lupita Nyong'... 2016 124 7.4 6753 8.81 73.0
853 976 My Big Fat Greek Wedding 2 [Comedy, Family, Romance] A Portokalos family secret brings the beloved ... Kirk Jones [Nia Vardalos, John Corbett, Michael Constanti... 2016 94 6.0 20966 59.57 37.0
854 980 The Skin I Live In [Drama, Thriller] A brilliant plastic surgeon, haunted by past t... Pedro Almodóvar [Antonio Banderas, Elena Anaya, Jan Cornet, Ma... 2011 120 7.6 108772 3.19 70.0
855 981 Miracles from Heaven [Biography, Drama, Family] A young girl suffering from a rare digestive d... Patricia Riggen [Jennifer Garner, Kylie Rogers, Martin Henders... 2016 109 7.0 12048 61.69 44.0
856 982 Annie [Comedy, Drama, Family] A foster kid, who lives with her mean foster m... Will Gluck [Quvenzhané Wallis, Cameron Diaz, Jamie Foxx, ... 2014 118 5.3 27312 85.91 33.0
857 983 Across the Universe [Drama, Fantasy, Musical] The music of the Beatles and the Vietnam War f... Julie Taymor [Evan Rachel Wood, Jim Sturgess, Joe Anderson,... 2007 133 7.4 95172 24.34 56.0
858 984 Let's Be Cops [Comedy] Two struggling pals dress as police officers f... Luke Greenfield [Jake Johnson, Damon Wayans Jr., Rob Riggle, N... 2014 104 6.5 112729 82.39 30.0
859 985 Max [Adventure, Family] A Malinois dog that helped American Marines in... Boaz Yakin [Thomas Haden Church, Josh Wiggins, Luke Klein... 2015 111 6.8 21405 42.65 47.0
860 986 Your Highness [Adventure, Comedy, Fantasy] When Prince Fabious's bride is kidnapped, he g... David Gordon Green [Danny McBride, Natalie Portman, James Franco,... 2011 102 5.6 87904 21.56 31.0
861 987 Final Destination 5 [Horror, Thriller] Survivors of a suspension-bridge collapse lear... Steven Quale [Nicholas D'Agosto, Emma Bell, Arlen Escarpeta... 2011 92 5.9 88000 42.58 50.0
862 988 Endless Love [Drama, Romance] The story of a privileged girl and a charismat... Shana Feste [Gabriella Wilde, Alex Pettyfer, Bruce Greenwo... 2014 104 6.3 33688 23.39 30.0
863 990 Selma [Biography, Drama, History] A chronicle of Martin Luther King's campaign t... Ava DuVernay [David Oyelowo, Carmen Ejogo, Tim Roth, Lorrai... 2014 128 7.5 67637 52.07 NaN
864 991 Underworld: Rise of the Lycans [Action, Adventure, Fantasy] An origins story centered on the centuries-old... Patrick Tatopoulos [Rhona Mitra, Michael Sheen, Bill Nighy, Steve... 2009 92 6.6 129708 45.80 44.0
865 992 Taare Zameen Par [Drama, Family, Music] An eight-year-old boy is thought to be a lazy ... Aamir Khan [Darsheel Safary, Aamir Khan, Tanay Chheda, Sa... 2007 165 8.5 102697 1.20 42.0
866 993 Take Me Home Tonight [Comedy, Drama, Romance] Four years after graduation, an awkward high s... Michael Dowse [Topher Grace, Anna Faris, Dan Fogler, Teresa ... 2011 97 6.3 45419 6.92 NaN
867 994 Resident Evil: Afterlife [Action, Adventure, Horror] While still out to destroy the evil Umbrella C... Paul W.S. Anderson [Milla Jovovich, Ali Larter, Wentworth Miller,... 2010 97 5.9 140900 60.13 37.0
868 995 Project X [Comedy] 3 high school seniors throw a birthday party t... Nima Nourizadeh [Thomas Mann, Oliver Cooper, Jonathan Daniel B... 2012 88 6.7 164088 54.72 48.0
869 997 Hostel: Part II [Horror] Three American college students studying abroa... Eli Roth [Lauren German, Heather Matarazzo, Bijou Phill... 2007 94 5.5 73152 17.54 46.0
870 998 Step Up 2: The Streets [Drama, Music, Romance] Romantic sparks occur between two dance studen... Jon M. Chu [Robert Hoffman, Briana Evigan, Cassie Ventura... 2008 98 6.2 70699 58.01 50.0
871 1000 Nine Lives [Comedy, Family, Fantasy] A stuffy businessman finds himself trapped ins... Barry Sonnenfeld [Kevin Spacey, Jennifer Garner, Robbie Amell, ... 2016 87 5.3 12435 19.64 11.0

872 rows × 12 columns

2.2 Revenue by Genre

The first aspect explored was to see if there was a clear genre which generated the most revenue. To do this, a dictionary was used to keep track of each genre's revenue. Sometimes a movie had multiple genres and to account for that, each genre of the movie was given the entirety of the movie's revenue. It was then put into a pie graph to best visualize the data.

In [2]:
#Dictionary to get each genre's total revenue
genre_list = {}
total_sum = sum(movie_data['Revenue (Millions)'])

#Goes through and gets each genre's revenue and adds it to the dictionary
for i in range(len(movie_data['Genre'])):
    for j in movie_data.loc[i,'Genre']:
        if j in genre_list:
            genre_list[j] = genre_list[j] + movie_data.loc[i,'Revenue (Millions)']
        else:
            genre_list[j] = movie_data.loc[i,'Revenue (Millions)']  

#DataFrame created holding each genre and their total revenue for the pie chart
df = pd.DataFrame()
df['Genres']=genre_list.keys()
df['Revenue'] = genre_list.values()

#Plots the dataframe as a pie chart and displays the percent, label name and genres
fig = px.pie(df, values='Revenue', names='Genres', title='Revenue By Genre')
fig.update_traces(textinfo='percent+label+value', textposition='inside')
fig.show()

From this graph it is clear that the largest genre is Adventure which generated around $38,852.61. It accounted for About 19.5\% of all the revenue amongst all movies. It can be seen that the top three genres are Adventure, Action and Drama and they account for nearly half of all the revenue generated among all movies. It can then be said that, according to the data, if a movie include one or more of these top three genre's, it will likely successfully generate a lot of money as shown by the trend in the pie graph.

2.3 Revenue Per Year

The next aspect explored was to see how revenue changed over the years and to see if there was a conclusion to be drawn from the data. For this part, for each year, all the movie's revenues were added together and plotted on a line graph to see if there is a clear trend.

In [3]:
revenue = []
for i in np.unique(movie_data['Year']):
    revenue.append(sum(movie_data[movie_data['Year'] == i]['Revenue (Millions)']))
plt.plot(np.unique(movie_data['Year']), revenue)
plt.xlabel('Years')
plt.ylabel('Revenue (Millions)')
plt.title('Revenue Per Year')
plt.show()

It can be seen from the line graph above that as the years go on, the revenue generated generally increases. There was an especially high increase from 2015 to 2016 that is interesting to see. It can also be seen that there us a significant dip in revenue between 2010 to 2011 but a sharp increase the year after. Despite these interesting flucuations, there is a clear general increase in revenue over the years.

2.4 Average Rating Per Year

One factor to consider is how the ratings changed over the years to see how movies generally did. To do this, all the ratings were summed up and averaged and plotted on a scatterplot. To help show if there is a general trend, a linear regression line was added. The polyfit function was used to help calculate the regression line.

In [4]:
#Gets the unique years to be plotted
x = np.unique(movie_data['Year'])
y = []
for i in x:
    temp = movie_data[movie_data['Year'] == i]
    y.append(sum(temp['Rating'])/len(temp))
    plt.scatter(i, y[-1], c = 'b')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, c = 'r')
plt.xlabel('Years')
plt.ylabel('Ratings')
plt.title('Average Ratings Per Year')
plt.show()

It can be seen from the graph above, that there is a general downward trend between 2006-2016. It can then be said that as the years go on, the average rating decreases. It should also be noted that it is not exactly a huge decrease in ratings. The average rating dropped by less than one, which indicates the average rating has not dropped siginificantly but that there is still a general downward trend between 2006-2016.

2.5 Revenue Per Ratings

Since the revenue and ratings were explored how they changed per year, the relationship between both variables is something to be explored to see if there was a relationship betwen ratings and revenues. In the dataset, there are two types of ratings. The first is the one that was preveiously explored which has the column label as 'Rating'. The other is the Metascore which is another type of rating based on the Metacritic website. Both these ratings' relationship will be explored in this section. To accomplish this graph, this function creates categories of ratings which can be either '0','1','2','3','4','5','6','7','8','9','10' or "No Data". To fit each category, each rating is floored, i.e. 4.3 and 4.9 would both be in the 4 category. Since the Metascore is out of 100, the Metascore is divided by 10 to scale it to the "Rating's" score and then floor it. The data is then plotted side by side, for the ease of comparison, as a bar graph. IT is important to note that in this graph each category is a range (4 spans from 4.0-4.9).

In [5]:
import math
rev_per_rating = [0] * 12
rev_per_meta = [0] * 12
for index in range(len(list(movie_data['Rating']))):
    rev_per_rating[math.floor(movie_data.loc[index,'Rating'])] += movie_data.loc[index,'Revenue (Millions)']
    if math. isnan(movie_data.loc[index,'Metascore']):
        rev_per_meta[-1] += movie_data.loc[index,'Revenue (Millions)']
    else:
        rev_per_meta[math.floor((movie_data.loc[index,'Metascore'])/10)] += movie_data.loc[index,'Revenue (Millions)']
plt.bar(np.arange(12), rev_per_rating, 0.35, color = 'b', label = "Ratings")
plt.bar(np.arange(12)+0.35, rev_per_meta, 0.35, color = 'r', label = "Metascores")
plt.xlabel('Ratings')
plt.ylabel('Revenue (Millions)')
plt.title('Revenue Per Ratings')
plt.xticks(np.arange(12)+0.35 / 2,['0','1','2','3','4','5','6','7','8','9','10',"No Data"])
plt.legend(loc = 'best')
plt.show()

From this bar graph, it can clearly be seen that most of the ratings from 'Rating' fall between the 6 and 8 range (inclusive) whereas the ratings from 'Metascore' have a curve shape with most rating falling between the 4 and 8 range (inclusive). The total revenue of movies whose rating was between 6 and 7 generated the most revenue for both type of ratings. From this given data it can be seen that even though other movies got higher ratings, the total revenue among all those movies were still lower than the ones whose rating was between the 6-7 range.

2.6.1 Revenues Per Top 50 Directors

Another contributing factor of success of a movie could possibly due to a specific director. To analyze this, each director is taken and checked to see how much revenue their movies have made. From there the directors are sorted in reverse order or descending order such that the director who generated the most revenue among all their movies is first and the director who generated the least revenue is last. Then only the top 50 are plotted to see the relationship between successful director and revenue.

In [6]:
director_list = {}
for i in range(len(movie_data['Director'])):
    director = movie_data.loc[i,'Director']
    if director in director_list:
        director_list[director] = director_list[director] + movie_data.loc[i,'Revenue (Millions)']
    else:
        director_list[director] = movie_data.loc[i,'Revenue (Millions)']
sorted_directors = sorted(director_list.items(), key=lambda x: x[1], reverse = True)
sorted_directors = sorted_directors[:50]
for x,i in enumerate(sorted_directors):
    plt.scatter(x,i[1], label = i[0] + " = " + str(round(i[1],2)) + " million")
plt.title("Revenues For the Top 50 Directors")
plt.xlabel("Directors")
plt.ylabel("Revenue (Millions)")
plt.legend(bbox_to_anchor=(1.0, 1.0), title="Legend", loc='upper left')
plt.show()

From the graph above, the general shape looks as if it is an exponential decrease. Between the couple of directors who generated the most revenue, the difference in revenue is large compared to the ones who generated the least in the top 50. It can then be seen that certain directors likely are more successful than others and can thus increase the success of a movie. It is also important to note that this does not take into consideration the numer of movies they made which could possibly influence the revenue generated.

2.6.2 Average Revenue Per Director

As mentioned in 2.6.1, the revenue that the director's generated could be influenced by the number of movies they made. For example it is possible that a director who had a movie that generated a lot of revenue and a movie that generated almost no revenue to be listed as higher than a director who had one successful movie that generated a decent amount of money. When averaging the total revenue each director generated by the number of movies they made, we get a better picture of how successful each director was. This can then be analyzed as a possible factor of what makes a movie successful. Like 2.6.1, each director's movie's revenue was summed up but this time it is divided by the number of movies the director made. This is the sorted in descending order and the top 50 are plotted.

In [7]:
d_list = {}
for i in np.unique(movie_data['Director']):
    tmp = movie_data[movie_data['Director'] == i]
    d_list[i] = sum(tmp['Revenue (Millions)'])/len(tmp)
sorted_directors = sorted(d_list.items(), key=lambda x: x[1], reverse = True)
sorted_directors = sorted_directors[:50]
for x,i in enumerate(sorted_directors):
    plt.scatter(x,i[1], label = i[0] + " = " + str(round(i[1],2)) + " million")
plt.title("Average Revenues For the Top 50 Directors")
plt.xlabel("Directors")
plt.ylabel("Average Revenue (Millions)")
plt.legend(bbox_to_anchor=(1.0, 1.0), title="Legend", loc='upper left')
plt.show()

As shown in this graph, the top directors have changed when compared to the previous graph in 2.6.1. This graph still has an exponential decrease curve to it, but now gives a better picture as to which directors are most successful and how much they averaged. The difference between top directors in the top 50 are large possibly suggesting that these directors direct successful movies and are better then the directors below them.

2.7 Creating the Keys - Crossreferencing data to make supplementary columns

This section is about creating the key t-charts to match actors, actresses, directors, and genres according to their rank or points from the two databases - IMDB and YGA (YouGovAmerica). These keys will be used to go through list of actors, directors, and list of genres in order to average or match their points/rank. Essentially, qualitative data like names and genre types become quantative through the use of these keys and can hopefully help with ML prediction.

2.7.1 Creating Keys from the IMDB lists

These are two key dataframe produced from these datasets:

Csvs located in the github repository

In [8]:
'''
Read IMDB datasets in and create two keys - one for actors and one for directors. 
Now, each director/actor is listed along with their point value representing their fame, subjectively to IMDB. 
'''

actor_data = pd.read_csv("imdb_actors.csv", encoding='latin-1')
actress_data = pd.read_csv("imdb_actresses.csv", encoding='latin-1')
director_data = pd.read_csv("imdb_directors.csv", encoding='latin-1')

def to_num(lst):
    ret = []
    for i in range(len(lst)):
        x = re.search("(\d+)[^\d]*", lst[i])
        ret.append(int(x.group(1)))
    return ret

actor_data['Description'] = to_num(actor_data['Description'])
actor_data.rename(columns={"Description": "Points"}, inplace = True)
actor_data = actor_data.filter(['Points', 'Name'])

actress_data['Description'] = to_num(actress_data['Description'])
actress_data.rename(columns={"Description": "Points"}, inplace = True)
actress_data = actress_data.filter(['Points', 'Name'])

imdb_cast_key = pd.concat([actor_data, actress_data])
imdb_cast_key = imdb_cast_key.sort_values('Points', ascending = False).reset_index(drop=True)

director_data['Description'] = to_num(director_data['Description'])
director_data.rename(columns={"Description": "Points"}, inplace = True)
imdb_director_key = director_data.filter(['Points', 'Name'])

display(imdb_cast_key)
display(imdb_director_key)
Points Name
0 167820 Morgan Freeman
1 160473 Brad Pitt
2 157865 Leonardo DiCaprio
3 151313 Robert De Niro
4 140716 Matt Damon
5 137003 Michael Caine
6 126932 Christian Bale
7 118663 Tom Hanks
8 112643 Gary Oldman
9 105941 Al Pacino
10 90851 Edward Norton
11 90295 Bruce Willis
12 88197 Harrison Ford
13 81985 Johnny Depp
14 79799 Cillian Murphy
15 76246 Ralph Fiennes
16 73563 Kevin Spacey
17 73401 Samuel L. Jackson
18 72904 Jack Nicholson
19 69889 Tom Cruise
20 68995 Philip Seymour Hoffman
21 66238 Robert Duvall
22 64287 Tom Hardy
23 64117 Natalie Portman
24 63982 Ryan Gosling
25 62549 Steve Buscemi
26 62411 Russell Crowe
27 62334 Liam Neeson
28 62064 Jake Gyllenhaal
29 61094 Joseph Gordon-Levitt
... ... ...
1466 538 Nigel Terry
1467 538 Cherie Lunghi
1468 536 Emma Catherine Rigby
1469 536 Olivier Martinez
1470 536 Tom Payne
1471 533 Peter Simonischek
1472 533 Sandra Hüller
1473 532 Jeanne Moreau
1474 532 Dana Wynter
1475 532 Misa Uehara
1476 527 Asier Etxeandia
1477 523 Rachel Roberts
1478 522 Matt Smith
1479 518 Helena Zengel
1480 517 Jane Wyman
1481 515 John Payne
1482 515 Maureen O'Hara
1483 515 Edmund Gwenn
1484 512 Mia Sara
1485 510 Judith Godrèche
1486 510 Celia Johnson
1487 507 Oliver Masucci
1488 507 Katja Riemann
1489 507 Sidse Babett Knudsen
1490 507 Christoph Maria Herbst
1491 504 Raf Vallone
1492 502 Nadja Uhl
1493 502 Jan Josef Liefers
1494 502 Johanna Wokalek
1495 501 Magali Noël

1496 rows × 2 columns

Points Name
0 143174 Christopher Nolan
1 131784 Steven Spielberg
2 108315 Quentin Tarantino
3 103526 Martin Scorsese
4 83185 David Fincher
5 70150 Ridley Scott
6 63606 Stanley Kubrick
7 63408 Robert Zemeckis
8 58488 Francis Ford Coppola
9 56012 Clint Eastwood
10 52964 Joel Coen
11 52350 Frank Darabont
12 41724 Alfred Hitchcock
13 40712 Sam Mendes
14 35492 James Cameron
15 35149 Danny Boyle
16 32144 Ron Howard
17 30689 Darren Aronofsky
18 30559 Todd Phillips
19 30153 Tim Burton
20 29887 Denis Villeneuve
21 27844 Wes Anderson
22 26241 Paul Greengrass
23 26033 Luc Besson
24 26022 Roman Polanski
25 25665 Alejandro G. Iñárritu
26 25099 Woody Allen
27 24377 Sergio Leone
28 24047 Guy Ritchie
29 23802 Steven Soderbergh
... ... ...
640 555 László Nemes
641 554 Emilio Estevez
642 548 Gillian Armstrong
643 547 Tony Goldwyn
644 546 Francis Veber
645 543 Kornél Mundruczó
646 537 Richard Brooks
647 536 Philipp Stölzl
648 533 Maren Ade
649 533 Tommy Lee Jones
650 532 Henry Hathaway
651 527 Wayne Wang
652 527 Florian Gallenberger
653 526 François Ozon
654 526 Audrey Wells
655 523 Michael Radford
656 518 Michael Crichton
657 516 Gary Sinise
658 515 Mike Leigh
659 515 George Seaton
660 510 Cédric Klapisch
661 508 Ronald Neame
662 507 David Wnendt
663 506 Jean-François Richet
664 505 Richard Fleischer
665 504 Peter Collinson
666 504 Jonas Ã…kerlund
667 504 Rod Lurie
668 502 Uli Edel
669 500 Peter Greenaway

670 rows × 2 columns

2.7.2 Creating Keys from the YGA pdfs

These are two key dataframe produced from these datasets:

Pdfs located in the github repository

In [9]:
'''
Read YGA datasets that are given the form of pdfs. Web scrapping proved to be two hard as the website was complex 
with javascript interactions. Various pdf-to-text were tested and pdfplumber proved to be the best.
We used pdfplumber to read text from the pdf and transfer those ranking into a dataframe key.  
Now, each director/actor is listed along with their point value representing their fame, subjectively to YGA. 
'''

def pdf_to_df(file):
    pdfdump = ""
    with pdfplumber.open(file) as pdf:
        for page in range(0, 6):
            pdfdump += pdf.pages[page].extract_text()
            pdfdump += "\n"
            
    key = {"Name":[], "Rank":[], "Fame(%)":[], "Popularity(%)":[]}
    for line in pdfdump.splitlines():
            match = re.search("(\d+)\s(.*)\s(\d+)%\s(\d+)%",line)
            if match:
                key["Name"].append(match.group(2))
                key["Rank"].append(match.group(1))
                key["Fame(%)"].append(match.group(3))
                key["Popularity(%)"].append(match.group(4))

    return pd.DataFrame(key)



yga_cast_key = pdf_to_df('yga_cast.pdf')
yga_director_key = pdf_to_df('yga_directors.pdf')

display(yga_cast_key)
display(yga_director_key)
Name Rank Fame(%) Popularity(%)
0 Robin Williams 1 96 84
1 Betty White 2 95 82
2 Denzel Washington 3 97 82
3 Morgan Freeman 4 96 80
4 Harrison Ford 5 96 77
5 Samuel L. Jackson 6 93 76
6 Sean Connery 7 92 76
7 Eddie Murphy 8 96 76
8 James Earl Jones 9 87 75
9 Tom Hanks 10 96 75
10 Keanu Reeves 11 95 75
11 Lucille Ball 12 91 74
12 Sandra Bullock 13 94 74
13 Michael J. Fox 14 94 73
14 Audrey Hepburn 15 89 72
15 Bill Murray 16 93 72
16 Clint Eastwood 17 94 72
17 Danny DeVito 18 93 72
18 Patrick Swayze 19 94 71
19 Danny Glover 20 88 71
20 Patrick Stewart 21 88 71
21 Jackie Chan 22 97 71
22 Robert Downey Jr. 23 96 71
23 Dick Van Dyke 24 89 71
24 Al Pacino 25 91 70
25 Leonardo DiCaprio 26 96 70
26 Gene Wilder 27 90 70
27 Liam Neeson 28 90 69
28 Bruce Willis 29 94 69
29 Jack Nicholson 30 92 69
... ... ... ... ...
1431 Millicent Simmonds 1446 30 16
1432 Jacki Weaver 1447 33 16
1433 Paul Walter Hauser 1448 30 16
1434 Robert Donat 1449 29 16
1435 Alfredo Quiroz 1450 29 16
1436 Daniel Bruhl 1451 29 16
1437 Lucas Hedges 1452 33 16
1438 Shahrukh Khan 1453 31 16
1439 Jordan Bolger 1454 31 16
1440 Deobia Oparei 1455 31 16
1441 Jamie Bell 1456 35 16
1442 Mireille Enos 1457 29 16
1443 Joe Taslim 1458 28 16
1444 Harold Russell 1459 32 15
1445 Richard Zeppieri 1460 28 15
1446 Fernanda Montenegro 1461 30 15
1447 Irrfan Khan 1462 30 15
1448 Shohreh Aghdashloo 1463 28 15
1449 Bates Wilder 1464 31 15
1450 Merritt Wever 1465 30 15
1451 Alexis Thorpe 1466 29 14
1452 Dan Petronijevic 1467 29 14
1453 Griffin Gluck 1468 31 14
1454 Joseph Schildkraut 1469 26 13
1455 Harry Maldonado 1470 29 13
1456 Janet McTeer 1471 26 13
1457 Niv Sultan 1472 26 13
1458 Adriana Barraza 1473 30 13
1459 Aleksey Serebryakov 1474 27 12
1460 Mohanlal 1475 26 12

1461 rows × 4 columns

Name Rank Fame(%) Popularity(%)
0 Steven Spielberg 1 92 70
1 Alfred Hitchcock 2 89 70
2 Walt Disney 3 95 70
3 Ron Howard 4 84 66
4 George Lucas 5 88 64
5 Mel Brooks 6 81 61
6 Jim Henson 7 79 58
7 Orson Welles 8 76 58
8 Quentin Tarantino 9 85 55
9 James Cameron 10 81 55
10 Martin Scorsese 11 80 54
11 Tim Burton 12 83 54
12 Penny Marshall 13 75 53
13 Spike Lee 14 89 49
14 Francis Ford Coppola 15 72 48
15 Rob Reiner 16 75 47
16 Wes Craven 17 73 47
17 Garry Marshall 18 65 46
18 Oliver Stone 19 78 46
19 Keenen Ivory Wayans 20 73 45
20 John Carpenter 21 63 44
21 Ingmar Bergman 22 65 44
22 Stanley Kubrick 23 66 43
23 Frank Capra 24 62 43
24 Cecil B. DeMille 25 65 42
25 J. J. Abrams 26 70 42
26 Peter Jackson 27 59 40
27 Ridley Scott 28 61 40
28 John Ford 29 61 39
29 M. Night Shyamalan 30 65 38
... ... ... ... ...
111 Brian Levant 112 33 18
112 Darren Aronofsky 113 33 18
113 Adam Shankman 114 32 18
114 Allen Hughes 116 31 17
115 Noah Baumbach 117 31 17
116 Chris Weitz 118 33 17
117 Abel Ferrara 119 33 17
118 Charles Shyer 120 29 16
119 Steven Zaillian 121 29 16
120 Alan Parker 122 35 16
121 David Twohy 123 30 16
122 Dennis Dugan 124 31 15
123 Brett Ratner 125 37 15
124 Bruce Beresford 126 29 15
125 Alejandro Amenábar 127 28 15
126 Bennett Miller 128 30 15
127 Brad Silberling 129 27 15
128 Alexander Payne 130 30 14
129 Andrew Davis 131 29 14
130 Anthony Minghella 132 27 14
131 John Carney 133 39 14
132 David Dobkin 134 28 14
133 Bob Rafelson 135 27 14
134 Brian Helgeland 136 28 14
135 Anthony McCarten 137 29 13
136 David S. Goyer 138 26 12
137 Courtney Solomon 139 27 12
138 Doug Liman 140 24 11
139 Andy Tennant 141 30 11
140 David Gordon Green 142 25 11

141 rows × 4 columns

2.7.2 Creating Key for Genre

This key dataframe produced from genre column of the original dataset.

In [10]:
sorted_genres = sorted(genre_list.items(), key=lambda x: x[1], reverse=True)

key = {"Genre":[], "Rank":[], "Total Revenue":[]}
count = 1
for (k, v) in sorted_genres: 
    key['Genre'].append(k)
    key['Rank'].append(count)
    key['Total Revenue'].append(v)
    count += 1
    
genre_key = pd.DataFrame(key)
genre_key
Out[10]:
Genre Rank Total Revenue
0 Adventure 1 38852.61
1 Action 2 35605.42
2 Drama 3 21931.45
3 Comedy 4 19316.45
4 Sci-Fi 5 14910.78
5 Fantasy 6 12262.06
6 Thriller 7 10645.32
7 Animation 8 8987.50
8 Crime 9 8034.62
9 Family 10 6182.61
10 Romance 11 5482.89
11 Mystery 12 4861.86
12 Biography 13 4185.12
13 Horror 14 3413.59
14 History 15 1376.02
15 Sport 16 1040.68
16 Music 17 706.05
17 Western 18 559.12
18 War 19 534.33
19 Musical 20 408.21

2.8 Creating new attribute columns from keys

Now we will use the keys generated above to create the five, new attribute columns.

In [11]:
'''
Address that there are 1492 actors/actress that dont have point associated
https://today.yougov.com/ratings/entertainment/popularity/all-time-actors-actresses/all
'''
def avg_lst_keys(x, key_data, lst_label, x_label, y_label):
    lst = x[lst_label]
    sum = 0
    count = 0
    for obj in lst:
        if not key_data[key_data[x_label] == obj].empty:
            pts = int(key_data.loc[key_data[x_label] == obj, y_label].iloc[0])
            sum += pts
            count += 1
        else:
            missing_obj.add(obj)
    if count:
        return (sum/count)
    else:
        return 0
    
def match_key(x, key_data, label):
    if not key_data[key_data['Name'] == x['Director']].empty:
        return int(key_data.loc[key_data['Name'] == x['Director'], label].iloc[0])
    else:
        missing_obj.add(x['Director'])
        return 0
        
In [12]:
missing_obj = set()
movie_data['imdb_avg_cast_pts'] = movie_data.apply(lambda x: avg_lst_keys(x, imdb_cast_key, "Actors", "Name", "Points"), axis = 1)
print("# of IMDB Missing Cast Members: ", len(missing_obj))            
            
missing_obj = set()
movie_data['yga_avg_cast_rank'] = movie_data.apply(lambda x: avg_lst_keys(x, yga_cast_key, "Actors", "Name", "Rank"), axis = 1)
print("# of YGA Missing Cast Members: ", len(missing_obj))       
            
missing_obj = set()
movie_data['imdb_avg_director_pts'] = movie_data.apply(lambda x: match_key(x, imdb_director_key, "Points"), axis = 1)
print("# of IMDB Missing Directors: ", len(missing_obj))
            
missing_obj = set()
movie_data['yga_avg_director_rank'] = movie_data.apply(lambda x: match_key(x, yga_director_key, "Rank"), axis = 1)
print("# of YGA Missing Directors: ", len(missing_obj))

missing_obj = set()
movie_data['genre_avg_rank'] = movie_data.apply(lambda x: avg_lst_keys(x, genre_key, "Genre", "Genre", "Rank"), axis = 1)
print("# of Missing Genres: ", len(missing_obj))
# of IMDB Missing Cast Members:  1104
# of YGA Missing Cast Members:  1081
# of IMDB Missing Directors:  283
# of YGA Missing Directors:  478
# of Missing Genres:  0

2.8.1

In this part, the relationship between revenue and "IMDB Average Cast Points", "YGA Average Cast Ranks", "IMDB Average Director Points", "YGA Average Director Rank", "Average Genre Rank" will be explored. The purpose is to see how each rank or points do compared to the revenue generated.

In [13]:
avgs = ['imdb_avg_cast_pts', 'yga_avg_cast_rank', 'imdb_avg_director_pts', 'yga_avg_director_rank', 'genre_avg_rank']
avg_name = ["IMDB Average Cast Points", "YGA Average Cast Ranks", "IMDB Average Director Points", "YGA Average Director Rank", "Average Genre Rank"]
for i,a in zip(avgs,avg_name):
    for j in range(len(movie_data[i])):
        plt.scatter(movie_data.loc[j,i], movie_data.loc[j,'Revenue (Millions)'], c = 'b')
    plt.title("Revenue Per " + a)
    plt.xlabel(a)
    plt.ylabel('Revenue (Millions)')
    plt.show()

In general, no graph between rank and revenue (whether it originates from IMDb or YGA, or concerns directors or actors) showed a strong relationship.

Lower Rank (1, 2, 3. etc) -> More popular directors, group of actors, or group of genres (data comes from YGA dataset) Higher Points (around 14,000) -> More popular directors or group of actors (data comes from IMDB dataset)

We would expect that... As rank increases, revenue decreases and as points increases, revenue increases. However this does not always seem to be the case as shown by the graphs above. In many cases, there are outliers that suggest less popular actors/directors make higher grossing films and vice versa, which is contradictory, yet an interesting find.

2.8.2

This part will attempt to fit the data into categories to make the analysis easier. It will be easier to draw conclusion by doing this. This will attempt to divide the data up into 100 sections and plot the point, which represents the averages, in each section to help draw a conclusion. A regression line will be added to help see the general trend of the data.

In [14]:
for i,a in zip(avgs,avg_name):
    x = np.linspace(0, max(movie_data[i]),100)
    m = []
    n = []
    for j in range(99):
        tempdf = movie_data[(movie_data[i]>=x[j]) & (movie_data[i]<=x[j+1])]
        if sum(tempdf['Revenue (Millions)']) > 0:
            m.append(x[j])
            n.append(sum(tempdf['Revenue (Millions)'])/len(tempdf))
            plt.scatter(m[-1], n[-1], c = 'b')
    line_parts = np.polyfit(m, n, 1)    
    line = np.poly1d(line_parts)
    y_part = line(m)
    plt.plot(m, y_part, c = 'r')
    plt.title("Revenue Per " + a)
    plt.xlabel(a)
    plt.ylabel('Revenue (Millions)')
    plt.show()

As seen in the graphs above,

  • IMDB Average Cast Points (graph 1): There is a higher concentration of points closer to 0 but still a very slight upward trend in points, suggesting that the higher the cast average is, the more revenue the movie will generate. However, since the correlation is so low, a firm conclusion cannot be drawn.
  • YGA Average Cast Rank (graph 2): From the graph, a clear relationship can be drawn between the average YGA cast rank and revenue. There is a downward trend which indicates that a movie with a higher average cast ranking(closer to 0) generally had a higher revenue than those with a lower average cast ranking. This could mean that a cast made up of actors of higher rankings will likely generate a higher revenue and vice versa.
  • IMDB Average Director Points (graph 3): This graph clearly does not have enough points to draw a firm conclusion. However with the aid of a regression line, it can be seen that there is a general upward trend in the data. This means that the higher the director’s points, the higher the revenue was. This suggests that a director with higher points would probably direct a movie with a higher revenue.
  • YGA Average Director Rank (graph 4): This graph shows a slight downward trend. The data shows that the higher the director’s rank was, the higher the revenue was. However, since the correlation is so low, a firm conclusion cannot be drawn.
  • Average Genre Rank (graph 5): From the graph, a clear relationship can be drawn between the average genre rank and revenue. There is a downward trend which indicates that a movie with a higher average genre ranking generally had a higher revenue than those with a lower average genre ranking. This could mean that a movie with a more popular genre will have a higher revenue than one with a less popular genre.

From all the graphs obtained, the results were as expected, but not as strongly correlated as one would expect. Generally, the higher the rankings and points were, the higher revenue was for the movie.

In [15]:
'''

'''
groups = [1, 2, 3, 4] # Real groups represented by the numbers here are: '0-15', '16-50', '51-115','116-1000'
bins = [-1, 15, 50, 115, 1000]

# use cut function and above defined bins/labels to create new group column in master df
movie_data['revenue_groups'] = pd.cut(x= movie_data['Revenue (Millions)'], bins=bins, labels=groups)

movie_data['revenue_groups'] = movie_data['revenue_groups'].astype(int)
print("Split of the revenue into range groups:")
print (movie_data['revenue_groups'].value_counts())
Split of the revenue into range groups:
1    236
4    217
3    212
2    207
Name: revenue_groups, dtype: int64
In [16]:
'''
Preparing the data for ML by dropping all qualitative columns and filtering out extreme values.
Transitioning over to a new version of the original dataset - movie_data2.
'''

movie_data2 = movie_data.filter(['Year', 'Runtime (Minutes)', 'Rating', 'Votes', 'Metascore', 'Revenue (Millions)', 'imdb_avg_cast_pts', 'yga_avg_cast_rank', 'imdb_avg_director_pts', 'yga_avg_director_rank', 'genre_avg_rank', 'revenue_groups'])
movie_data2.replace([np.inf, -np.inf], np.nan, inplace=True) # replace extreme values 
movie_data2.fillna(0, inplace=True) # replace nan values with 0 in other to create valid integer/float columns for ML
movie_data2.reset_index(drop=True)
movie_data2.head()
Out[16]:
Year Runtime (Minutes) Rating Votes Metascore Revenue (Millions) imdb_avg_cast_pts yga_avg_cast_rank imdb_avg_director_pts yga_avg_director_rank genre_avg_rank revenue_groups
0 2014 121 8.1 757074 76.0 333.13 28064.500000 304.500000 0 0 2.666667 4
1 2012 124 7.0 485820 65.0 126.46 23137.333333 784.666667 70150 28 6.000000 4
2 2016 117 7.3 157606 62.0 138.12 12445.000000 1015.000000 18353 30 10.500000 4
3 2016 108 7.2 60545 59.0 270.32 38746.000000 204.250000 0 0 7.333333 4
4 2016 123 6.2 393727 40.0 325.02 34086.750000 465.000000 10153 0 3.000000 4
In [17]:
'''
Setting the ML models and their hyperparameters to be tested. These are the 3 best models that were run, others were
tested including logisitic regression, linear svm, etc. 
'''

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

models = {
    
    'K - Nearest Neighbors': {'type': KNeighborsClassifier(),
                          'params': [{'n_neighbors': [1, 3, 5, 10, 50], 'leaf_size': [3, 30]}]
                        },
    
    'Decision Tree': {'type': tree.DecisionTreeClassifier(),
                       'params': [{'max_depth': [3, None]}]
                      },
    
    'Random Forest': {'type': RandomForestClassifier(),
                      'params': [{'n_estimators': [500]}]
                     } 
}
In [20]:
'''
The main ML training script. First, split the data in test and train in a 80-20 split. 
Then, create a results dataframe that will eventually showcase results - accuracy score, precision, and other statistics.
Next, we will loop through each model declared above and use GridSearchCV to run every potential combination of
parameters for the best possible outcomes and attach those results to the dataframe. All of this will also be timed, 
to track scoring efficiency. Lastly, we will create and calculate statistics from a confusion matrix and showcase those as
well.
'''
import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV 
#!pip install pycm
from pycm import ConfusionMatrix
from sklearn.metrics import accuracy_score


# Defining the X and Y(target) values
X = movie_data2.drop(['revenue_groups','Revenue (Millions)'], axis=1)
Y = movie_data2['revenue_groups']

# Split 80-20 train and test data
X_train, X_test, Y_train, Y_test  = train_test_split(X, Y, test_size=0.2)
print(len(X_train),'samples in training data and', len(X_test),'samples in test data\n')


# dataframe to write out comparison points of each models
df_results = pd.DataFrame(
        data=np.zeros(shape=(2,8)),
        columns = ['classifier',
                   'best_params',
                   'train_score', 
                   'test_score',
                   'Accuracy by Class',
                   'Precision by Class',
                   'FPR by Class',
                   'FNR by Class'])

# Loop through dictionary defined previously, fit two models, score data, create confusion matrix, do calculations and print out results
count = 0
for name, model in models.items():
            
        t_start = time.process_time()
    
        grid = GridSearchCV(model['type'], model['params'], refit=True, cv = 10, scoring = 'accuracy')
        estimator = grid.fit(X_train, Y_train)
        
        t_end = time.process_time()
        t_diff = t_end - t_start
        
        #score fitted model and save predictions for test dataset
        train_score = estimator.score(X_train, Y_train)
        test_score = estimator.score(X_test, Y_test)
        Y_pred = estimator.best_estimator_.predict(X_test)  
        
    
        cm = ConfusionMatrix(actual_vector=Y_test.to_numpy(),predict_vector=Y_pred) # create confusion matrix for below comparison points
        FP = np.array(list(cm.FP.values()))
        FN = np.array(list(cm.FN.values()))
        TP = np.array(list(cm.TP.values()))
        TN = np.array(list(cm.TN.values()))
        
        
        PPV = TP/(TP+FP) # Precision or positive predictive value for each target class
        
        FPR = FP/(FP+TN) # False Positive rate for each target class
        
        FNR = FN/(TP+FN) # False Negative rate for each target class

        ACC = (TP+TN)/(TP+FP+FN+TN) # Accuracy for each target class
        
        
        # Set the generated results into the results dataframe
        df_results.loc[count,'classifier'] = name
        df_results.loc[count,'best_params'] = str(estimator.best_params_)
        df_results.loc[count,'train_score'] = train_score
        df_results.loc[count,'test_score'] = test_score
        df_results.loc[count,'Accuracy by Class'] = str(ACC)
        df_results.loc[count,'Precision by Class'] = str(PPV)
        df_results.loc[count,'FPR by Class'] = str(FPR)
        df_results.loc[count,'FNR by Class'] = str(FNR)
        df_results.loc[count, '10-fold CV error estimate (w/ stderr)'] = (estimator.cv_results_.get('std_test_score').mean())/(math.sqrt(10))
        
        print("trained {c} in {f:.2f} s".format(c=name, f=t_diff))
        
        count += 1
697 samples in training data and 175 samples in test data

C:\Users\jantisseril\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning:

The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.

trained K - Nearest Neighbors in 1.19 s
C:\Users\jantisseril\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning:

The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.

trained Decision Tree in 0.30 s
C:\Users\jantisseril\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning:

The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.

trained Random Forest in 15.44 s
In [21]:
display(df_results.sort_values(by='test_score', ascending=False)) # View each model's results side by side to compare attributes
classifier best_params train_score test_score Accuracy by Class Precision by Class FPR by Class FNR by Class 10-fold CV error estimate (w/ stderr)
2 Random Forest {'n_estimators': 500} 1.000000 0.548571 [0.84 0.69714286 0.70857143 0.85142857] [0.71428571 0.40425532 0.33333333 0.7 ] [0.09302326 0.21374046 0.17647059 0.11627907] [0.34782609 0.56818182 0.69230769 0.23913043] 0.018704
1 Decision Tree {'max_depth': 3} 0.571019 0.485714 [0.80571429 0.61714286 0.70857143 0.84 ] [0.71428571 0.33802817 0.3125 0.70454545] [0.0620155 0.35877863 0.16176471 0.10077519] [0.56521739 0.45454545 0.74358974 0.32608696] 0.021783
0 K - Nearest Neighbors {'leaf_size': 3, 'n_neighbors': 50} 0.552367 0.474286 [0.78857143 0.64571429 0.66857143 0.84571429] [0.61538462 0.30434783 0.28888889 0.71111111] [0.11627907 0.24427481 0.23529412 0.10077519] [0.47826087 0.68181818 0.66666667 0.30434783] 0.019865

Conclusion

The best performing machine learning models were K Nearest Neighbors, Random Forest, and Decision Tree. Initially, before adjusting the hyperparameters we saw test scores averaging around 30% for each model, and around 50-60% afterwards. As an example, the hyperparameters that we adjusted for K Nearest Neighbors were the number of neighbors and the leaf size. Though our models only average 50-60% for test scores, they are still somewhat useful. In future data analysis, we would hope to have more data in general as well as additional attribute columns. One example of a useful attribute would be the budget given for each movie since budget would be an excellent indicator of how much money film investors might expect to profit off of. Including this information as well as any additional data from pre-film launch would lead to much better prediction accuracy and stronger models overall.

We believe that our analysis and machine learning predictions are a good launching point for a further deep dive into this use case. Movie revenue prediction could become a very important factor in Hollywood and this notebook represents the tip of that venture. Thank you!!